Skip to content

fix(pt_expt): Fix compiled ckpt load key mismatch#5468

Open
anyangml wants to merge 3 commits into
deepmodeling:masterfrom
anyangml:fix/pt-expt-compile-restart
Open

fix(pt_expt): Fix compiled ckpt load key mismatch#5468
anyangml wants to merge 3 commits into
deepmodeling:masterfrom
anyangml:fix/pt-expt-compile-restart

Conversation

@anyangml
Copy link
Copy Markdown
Collaborator

@anyangml anyangml commented May 27, 2026

This resolves restart failure.

Summary by CodeRabbit

  • Bug Fixes

    • Fixed checkpoint compatibility for compiled models so saved checkpoints match the expected model state and can be resumed reliably.
    • Improved compiled-model wrapper behavior to preserve access to underlying model methods and properties.
  • Tests

    • Added tests covering compiled-model attribute delegation and checkpoint restart scenarios to prevent regressions.

Review Change Stack

Copilot AI review requested due to automatic review settings May 27, 2026 09:15
@dosubot dosubot Bot added the bug label May 27, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 21ad68a9-6cc8-40e4-9328-e6b4fefed12f

📥 Commits

Reviewing files that changed from the base of the PR and between 22e79fa and ab97b0a.

📒 Files selected for processing (1)
  • deepmd/pt_expt/train/training.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • deepmd/pt_expt/train/training.py

📝 Walkthrough

Walkthrough

Adds __getattr__ to _CompiledModel to delegate unknown attribute access to its original_model, and updates Trainer.save_checkpoint() to temporarily unwrap compiled models when building the checkpoint state dict. Unit and integration tests validate delegation and restart-from-compiled checkpoints.

Changes

Compiled Model Checkpoint Serialization

Layer / File(s) Summary
_CompiledModel attribute delegation
deepmd/pt_expt/train/training.py, source/tests/pt_expt/test_training.py
_CompiledModel.__getattr__ delegates unknown attribute access to original_model. Unit tests verify method and property delegation, non-scalar returns, that wrapper-owned attributes/submodules are not delegated, and that missing attributes raise AttributeError.
Checkpoint serialization and restart
deepmd/pt_expt/train/training.py, source/tests/pt_expt/test_training.py
Trainer.save_checkpoint() swaps per-task _CompiledModel instances with their original_model before calling wrapper.state_dict(), then restores them in a finally block. Integration test trains with compile enabled, asserts the saved state dict contains unwrapped keys (no _CompiledModel wrapper-specific keys), verifies loading into an uncompiled wrapper succeeds with no missing/unexpected keys, and restarts from the compiled checkpoint with start_step restored and successful training producing lcurve output.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 47.37% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(pt_expt): Fix compiled ckpt load key mismatch' clearly describes the main change: fixing a compiled checkpoint loading key mismatch issue in the pt_expt module, which directly aligns with the code changes that address checkpoint state_dict key name mismatches.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes restart failures when training with torch.compile enabled in the pt_expt backend by ensuring compiled-model wrappers don’t leak wrapper-prefixed keys into saved checkpoints, and by making the compiled wrapper transparently expose attributes/methods of the original model.

Changes:

  • Add _CompiledModel.__getattr__ to delegate unknown attribute/method lookups to the wrapped original_model.
  • Update checkpoint saving to temporarily unwrap _CompiledModel instances so saved state_dict keys match what a fresh ModelWrapper expects.
  • Add regression/unit tests covering attribute delegation and restart-from-compiled-checkpoint behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
deepmd/pt_expt/train/training.py Adds attribute delegation for _CompiledModel and unwraps compiled models during checkpoint serialization to prevent key mismatches on restart.
source/tests/pt_expt/test_training.py Adds unit/regression tests for _CompiledModel delegation and compiled-checkpoint restart key compatibility.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread deepmd/pt_expt/train/training.py
@anyangml anyangml requested a review from wanghan-iapcm May 27, 2026 09:23
@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.46%. Comparing base (4e64f8b) to head (ab97b0a).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5468   +/-   ##
=======================================
  Coverage   82.46%   82.46%           
=======================================
  Files         829      829           
  Lines       88763    88777   +14     
  Branches     4225     4225           
=======================================
+ Hits        73197    73211   +14     
+ Misses      14274    14273    -1     
- Partials     1292     1293    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@wanghan-iapcm
Copy link
Copy Markdown
Collaborator

plz address the copilot s comment

@anyangml anyangml requested a review from wanghan-iapcm May 28, 2026 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants